Module 3: Introduction
University of South Florida
Artificial Intelligence (AI) : intelligent machine
Those machines are called intelligent as:
They have decision-making capabilities
like human beings (i.e., “smart”)
Machine Learning (ML): A subset of AI
Algorithms make the computer to learn
Some from Statistics, others from Computer science
Neural network (Deep Learning) one of such algorithm
Find patterns and “cluster” those have similar patterns found using \(X\).
Principal Component Analysis (PCA)
Hierarchical (K-means) clustering
Latent Dirichlet Allocation (LDA)
Neural Network (e.g., autoencoder)
Unsupervised learning is useful for
Descriptive purpose (patterns)
Garnering insights from data
Dementionality reduction (feature selections)
Stock (Fund) clustering for portfolio
Clustering
Transaction Anomaly detection
Sentiment analysis (topic discovery)
Preprocessing step
You know by asking yourself if the data has ground truth (Y) or not, aside from predictor (X) variables.
Group investors by their trading patterns of each brokerage account holders
Cluster credit card transactions types patterns by time, location, amount, frequencies
Unsupervised learning problems can be grouped into three:
Clustering:
Association:
Dimentionality reduction
Supervised learning method’s objective is:
For truth value Y, find the best predictor model f given data X for
\[ Y \sim f(X) \]
Linear regressions
Logistic regressions (categorical)
Decision Trees
Boosted Trees
Neural network
Supervised learning can be grouped into regression and classification problems.
When the predicted variable, Y, is:
Continuous variable: Regression
Discrete variable: Classification
Regression / Classification, or Clustring / Association?
Regression / Classification, or Clustring / Association?
Regression / Classification, or Clustring / Association?
Regression / Classification, or Clustring / Association?
Regression / Classification, or Clustring / Association?
Regression / Classification, or Clustring / Association?
Regression / Classification, or Clustring / Association?
Simplest splitting scheme.
We need data to train/fit parameters for the specified model.
Then we need to validate the model performance with “unseen” data.
ML operations involve the hyperparameter tuning process.
Hyperparameters are parameters in the configuration of the model that are NOT LEARNED from the model but set prior to the training process. They govern the performance of the ML and also the training time.
Wihtout validation data, the test results can be biased towards on a specific hyptermarameter setting.
Set apart one more data for one step further validation
Typically 70/15/15
Pick the model with best perfomance on valid, and final check on test
Instead of 3 part split (train-valid-test), a more robust technique used to access ML performance.
k-fold CV
Divide train data into k equally-sized folds
Each fold is once used as validation set
Average out the performance by each hyperparameter setting
Pick the best model
Final check on test (this hold-out set is optional)
How can we determine the “quality” of ML model?
What makes one ML model “better”?
Unsupervised algorithms do not have ground TRUTH to measure accuracy. Therefore, discussions on bias/variance tradeoffs are relevant to supervised algorithms.
Bias is the error of the prediction of a model.
Especially related to train set prediction error.
Regression: \(\frac{1}{n}\sum\limits_{i = 1}^{n} (f(X_i) -Y_i)^2\)
Classification: \(\frac{1}{n}\sum\limits_{i = 1}^{n} 1_{f(X_i) \neq Y_i}\)
Low bias: predicted data points are close to the target
High bias: predicted data points are far from the target.
Based on the training set, we have three linear models:
In general,
Higher complexity, higher accuracy
Higher complexity, lower interpretability (black-boxy)
If we attain low bias (high accuracy) from our train data in our model:
We care about the model’s generalize-ability.
Therefore:
We must test the model accuracy with new data
that is not used for training the model
Variance quantifies the sensitivity of parameter estimations to fluctuations in the data setting.
It checks out how “reliable” the model is, out of sample.
High variance: the model does not generalize well
Low variance: the model performance is reliable when outside data is given
If we use trained model for prediction on new dataset:
If we use trained model for prediction on new dataset:
A large drop in test accuracy compared to training
Classic overfitting problem
Not reliable to unseen data
If we use trained model for prediction on new dataset:
# Plot fitted line on top of train set
base_plot <- train |>
ggplot(aes(x = Age, y = Salary)) +
geom_point() +
theme_bw()
plot_linear <- base_plot +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE)
plot_quad <- base_plot +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)
plot_poly <- base_plot +
geom_smooth(method = "lm", formula = y ~ poly(x, 5), se = FALSE)
plot_linear
plot_quad
plot_polyWhat is the train accuracy (using \(R^2\) as metric)?
# A tibble: 10 × 5
Age Salary Predicted_sal_1 Predicted_sal_2 Predicted_sal_3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 30 166000 165980. 160184. 117293.
2 26 78000 150670. 113172. 111072.
3 58 310000 273144. 271282. 245230.
4 29 100000 162152. 149161. 104937.
5 40 260000 204253. 243654. 284898.
6 27 150000 154498. 125655. 99725.
7 33 140000 177461. 190334. 173238.
8 61 220000 284626. 260559. 257932.
9 27 86000 154498. 125655. 99725.
10 48 276000 234871. 275396. 277161.
\[ R^2 = 1- \frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y})^2} = 1 - \frac{SSR}{SST} \]
Calculate R2
base_plot <- test |>
ggplot(aes(x = Age)) +
theme_bw() +
geom_point(aes(y = Salary), color = "black")
model1_plot <- base_plot +
geom_point(aes(y = Predicted_sal_1), color = "blue4") +
geom_smooth(
aes(y = Predicted_sal_1),
method = "lm",
formula = y ~ x,
color = "blue4",
se = FALSE
)
model2_plot <- base_plot +
geom_point(aes(y = Predicted_sal_2), color = "green4") +
geom_smooth(
aes(y = Predicted_sal_2),
method = "lm",
formula = y ~ poly(x, 2),
color = "green4",
se = FALSE
)
model3_plot <- base_plot +
geom_point(aes(y = Predicted_sal_3), color = "red4") +
geom_smooth(
aes(y = Predicted_sal_3),
method = "lm",
formula = y ~ poly(x, 5),
color = "red4",
se = FALSE
)
model1_plotUsing above train / test data, fit the model with 3rd order polynomial regression:
\[ Salary = \beta_0 + \beta_1Age + \beta_2Age^2 + \beta_3Age^3 + e \]
\[ Salary = \beta_0 + \beta_1Age + \beta_2Age^2 + \beta_3Age^3 + \beta_4Age^4 + \epsilon \]
Report train accuracy (\(R^2\)) and test accuracy. Which model performs better?
Visualize your work, and work with Quarto and render .html for report.
John C. Hull “Machine Learning in Business”
FIN6776: Big Data and Machine Learning in Finance